This class prep will introduce to some of the material for the next class. There are some optional tasks at the end of this notebook that you can choose to complete if you would like to. We will use this as the starting point for the next class.
Lets make a beautiful chart
We will be using the tidyverse and nycflights13 packages for this. The first step as always is to load these from our package library using the library() function.
library(tidyverse)
library(nycflights13)
We need a chart to improve upon. Lets create something to study the average hourly departure delay for the three airports in NYC. Lets start by creating the data for this plot. To do this, I need to group the flights data by the hour and origin variables and then summarise to calculate the hourly mean departure delay for each origin airport. See the code and output below. For every hour (with a flight) and origin data group we calculate the average departure delay.
Keep the data above in your mind. Now lets visualize the same data using a line chart. We do this using a ggplot function that takes the plotData as the input. This time I map the aethetics within the geom_line() function.
ggplot(data = plotData) +geom_line(mapping =aes(x = hour, y = meanDelay))
geom_line() connects observations in the order of the variable on the x axis
Connects obs. in order of the variable on the x -axis. This means that the geom_line() function is moving from the left-side of the x-axis to the right and looking for corresponding values of the y variable (meanDelay) to plot. There are no values of meanDelay for 1:00 AM till 5:00 AM so there is nothing to plot until 5:00 AM. 1 When it gets to hour 5 on the x-axis it see 3 values for meanDelay (one for each origin airport LGA, EWR and JFK) so it plots those. All of these will appear in the same vertical line since x-axis value for all three values is the same i.e. 5. It similarly moves through all the x-axis values to plot all the y-variable values one after the other.
We can see this more clearly by overlaying points to this chart.
ggplot(data = plotData) +geom_line(mapping =aes(x = hour, y = meanDelay)) +geom_point(mapping =aes(x = hour, y = meanDelay, colour = origin))
Look at the code carefully. There are a few things to point out.
The mapping is applied within each geom function. Notice how that the x and y aesthetics are repeated across the two geoms. We will improve this in the next step by moving common aesthetics to the ggplot() function
Notice the use of the colour aesthetic in the mapping argument for the geom_point() function. I have added this additional aesthetic and mapped it to the origin variable to group the dots by origin and give each origin value a different color. Will see how we can advance this to draw the line chart correctly.
Let improve our code before moving on and correcting the line chart. Notice how I have updated the code from above by moving the common aesthetics that apply to both the geom function into the ggplot function. I have also mapped an additional aesthetic to the color variable within the geom_point() function. The chart from this code is the exact same as the previous step (but the code is better).
geom_line() connects observations in the order of the variable on the x axis
The correct output that we want to see are three lines - one for each origin airport. So for instance, all the red dots for EWR should have a single line that goes through it and similarly for JFK and LGA dots. So how do we tell geom_line() function to go in order of the variable on x-axis but within each origin group. We can do this by setting the group aesthetic in the mapping argument for geom_line(). Lets try that out.
Much better! Now we have three line one for each origin. Each of those going through the dots corresponding to those origins. But lets not stop here. What if we want each line to have a color of its own. We can do this easily by mapping the color aesthetic to the origin. Lets try it.
Beautiful. The keen eyed amongst you might have already spotted how we can improve this code by moving the repeated aesthetic for color to the ggplot() function. Lets do that.
We can improve our code a little bit more. Since we applied the color aesthetic to origin, ggplot() already knows that the data is grouped by origin. This means that the grouping aesthetic within geom_line() is redundant. Lets remove that.
Now we are ready to start making this plot look better. The first thing I want to do is remove those warning that are displayed at the top of the plot. These are telling us that we have one row that had a value that could not be plotted. Can you look at the data we printed out earlier to guess which row that might be?
These warnings can be removed easily by adding a na.rm = T argument to the geom functions. This removes any data points that might be missing a value for an aesthetic that has been mapped. So in our case it removes those rows that are missing either a value for origin or for hour.
Lets also change the characteristics of the dots by changing some of the default values of aesthetics that have not been mapped to a variable. So for instance, each dot has an opacity value that is controlled by the alpha aesthetic which defaults to 1, and a default radius that is controlled by size. Lets change these to adjust their look and feel.
In the code above I reduced the opacity of the dots and increased their size. Notice how I haven’t used an aes() function to set the values of alpha and size. This is because aes() is exclusively used to map data to aesthetics. In this case, I have reassigned the default values of the aesthetic instead of mapping these to variables.
Now lets change the axis titles so that they are more useful to the viewer. You can do this using the labs() function.
ggplot(data = plotData, mapping =aes(x = hour, y = meanDelay, colour = origin)) +geom_line(na.rm = T) +geom_point(alpha =0.5, size =2.5, na.rm = T) +labs(
x ="Hour of departure",
y ="Mean departure delay (in minutes)"
)
Now that we have our axis titled correctly, lets add a title. A good title succinctly captures a key takeaway from a chart that then leads on to your next piece of analysis. A chart that does not have a key takeaway is a waste of everyone’s time and should not be published! Lets take a look at a few examples.
BAD title: “Hourly Vs. Mean departure delay”. This title tells us nothing about the insight from this chart.
GOOD title: “Evening flights experience greater delays”. This title clearly summarizes the takeaway from this chart. The subsequent analysis might look into why this might be the case.
BAD title: “Departure delay by origin”. Tells me nothing about whats happening in the chart.
GOOD title: “On average EWR departures experience greater delays”. Clearly summarizes a takeaway. Subsequent analysis might be on whats wrong with EWR.
Lets keep this in mind and add a title to the labs function.
ggplot(data = plotData, mapping =aes(x = hour, y = meanDelay, colour = origin)) +geom_line(na.rm = T) +geom_point(alpha =0.5, size =2.5, na.rm = T) +labs(
x ="Hour of departure",
y ="Mean departure delay (in minutes)",
title ="Avoid evening flights out of NYC if possible"
)
You might have noticed that there are a lot of default choices that ggplot makes in terms of style. For instance, the background to our plot is grey, there are gridlines, the axis title background is white, the legend is plotted on the right etc. We can change all of this if you want to using the theme() function. For now, lets use it to change the font size of the axis text and title, and remove the minor gridlines.
ggplot(data = plotData, mapping =aes(x = hour, y = meanDelay, colour = origin)) +geom_line(na.rm = T) +geom_point(alpha =0.5, size =2.5, na.rm = T) +labs(
x ="Hour of departure",
y ="Mean departure delay (in minutes)",
title ="Avoid evening flights out of NYC if possible"
) +theme(
axis.text =element_text(size =14),
axis.title =element_text(size =16),
panel.grid.minor =element_blank()
)
Instead of having to set the different style aspects individually, you can also pick from a few different preconfigured themes that come with gggplot2. Lets pick the minimal theme.
ggplot(data = plotData, mapping =aes(x = hour, y = meanDelay, colour = origin)) +geom_line(na.rm = T) +geom_point(alpha =0.5, size =2.5, na.rm = T) +labs(
x ="Hour of departure",
y ="Mean departure delay (in minutes)",
title ="Avoid evening flights out of NYC if possible"
) +theme(
axis.text =element_text(size =14),
axis.title =element_text(size =16),
panel.grid.minor =element_blank()
) +theme_minimal()
Notice how the theme_minimal() has removed the changes that we made using the theme() function in the previous step. Each latest application of a theme function resets all overlapping style defaults that might have been set before it. So in our case, theme_minimal() overwrites the values we set for the minor gridlines and the axis text and title. Lets clean up our code and remove the theme() function since we are not using it.
ggplot(data = plotData, mapping =aes(x = hour, y = meanDelay, colour = origin)) +geom_line(na.rm = T) +geom_point(alpha =0.5, size =2.5, na.rm = T) +labs(
x ="Hour of departure",
y ="Mean departure delay (in minutes)",
title ="Avoid evening flights out of NYC if possible"
) +theme_minimal()
Our chart is looking good but I’d like to make it better.
I’d like to get rid of the legend and label the names of the origin right next to the lines themselves. To do this lets try using geom_text() to do this. Lets also remove the legend from our plot by setting its position to “none” within the theme() function.
Lets test this out.
ggplot(data = plotData, mapping =aes(x = hour, y = meanDelay, colour = origin)) +geom_line(na.rm = T) +geom_point(alpha =0.5, size =2.5, na.rm = T) +labs(
x ="Hour of departure",
y ="Mean departure delay (in minutes)",
title ="Avoid evening flights out of NYC if possible"
) +theme_minimal() +geom_text(mapping =aes(label = origin, color = origin), na.rm = T) +theme(
legend.position ="none"
)
AAAHHH! Thats not what we wanted. Since I have given the entire dataset to the geom_text() function it has plotted all those values on to the plot. What we want however, is a single value for each line that is plotted at the end of the line. Lets try feeding it data with only the most recent hour of data.
ggplot(data = plotData, mapping =aes(x = hour, y = meanDelay, colour = origin)) +geom_line(na.rm = T) +geom_point(alpha =0.5, size =2.5, na.rm = T) +labs(
x ="Hour of departure",
y ="Mean departure delay (in minutes)",
title ="Avoid evening flights out of NYC if possible"
) +theme_minimal() +geom_text(
data = plotData %>%filter(hour ==max(hour, na.rm = T)),
aes(label = origin, colour = origin)
) +theme(
legend.position ="none"
)
Interesting. Now its plotting the labels for EWR and JFK at the end of the line but not for LGA. Lets look at the data a bit more closely. The table below is the original plot data that we have been using so far. We want to filter this to only include the last hour of the data.
plotData
The table below shows that there were no flights that took of from LGA for the last hour in the data. This means that we need to update our code to filter the last hour for each origin.
Good. This is the data that should be used to plot the labels. Lets try this out.
Much better. Lets now nudge the labels vertically and replace the geom_text() with geom_label(). Lets also make font bold.
ggplot(data = plotData, mapping =aes(x = hour, y = meanDelay, colour = origin)) +geom_line(na.rm = T) +geom_point(alpha =0.5, size =2.5, na.rm = T) +labs(
x ="Hour of departure",
y ="Mean departure delay (in minutes)",
title ="Avoid evening flights out of NYC if possible"
) +theme_minimal() +geom_label(
data = plotData %>%group_by(origin) %>%filter(hour ==max(hour, na.rm = T)),
aes(label = origin, colour = origin),
vjust =1.5,
fontface ="bold",
alpha =0.8
) +theme(
legend.position ="none"
)
Even he is impressed!
Now the chart is publication ready. The final part is to save it. We can do this using the ggsave() function. Plots can saved in the following formats: “eps”, “ps”, “tex” (pictex), “pdf”, “jpeg”, “tiff”, “png”, “bmp”, “svg” or “wmf” (windows only). If you use a filename that includes the extension, the ggsave will automatically infer the preferred format and use that. If you don’t use the extension you can explicitly set it using the “device” argument. By default, ggsave() saves the last plot that was created in your notebook or script. The file is saved in the folder where you notebook is stored.
Calculate the average monthly departure delay for each origin and use that data to recreate the same chart as above (the only difference is that you will be using month instead of hour for x aesthetic)
Install the ggthemes package and load it using library()
Apply the theme_fivethirtyeight() instead of theme_minimal()
Harder task: theme_fivethirtyeight() removes the axis titles that we created. Can you use google to find an answer for how to bring this back?
The value for hour 1 is NaN because there is a single flight that took from EWR at 1:00 AM, however, the dep_delay for that flight is missing i.e. NA and mean(NA, na.rm = T) returns NaN. As a result the the line chart is empty until hour 5.↩